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Abstract 



Scene parsing, or semantic segmentation, consists in la- 
beling each pixel in an image with the category of the object 
it belongs to. It is a challenging task that involves the simul- 
taneous detection, segmentation and recognition of all the 
objects in the image. 

The scene parsing method proposed here starts by com- 
puting a tree of segments from a graph of pixel dissimilari- 
ties. Simultaneously, a set of dense feature vectors is com- 
puted which encodes regions of multiple sizes centered on 
each pixel. The feature extractor is a multiscale convolu- 
tional network trained from raw pixels. The feature vec- 
tors associated with the segments covered by each node in 
the tree are aggregated and fed to a classifier which pro- 
duces an estimate of the distribution of object categories 
contained in the segment. A subset of tree nodes that cover 
the image are then selected so as to maximize the aver- 
age ''purity " of the class distributions, hence maximizing 
the overall likelihood that each segment will contain a sin- 
gle object. The convolutional network feature extractor is 
trained end-to -end from raw pixels, alleviating the need for 
engineered features. After training, the system is parameter 
free. 

The system yields record accuracies on the Stanford 
Background Dataset (8 classes), the Sift Flow Dataset (33 
classes) and the Barcelona Dataset (170 classes) while 
being an order of magnitude faster than competing ap- 
proaches, producing a 320 x 240 image labeling in less 
than 1 second. 



1. Overview 

Full scene labeling (FSL) is the task of labeling each 
pixel in a scene with the category of the object to which it 
belongs. FSL requires to solve the detection, segmentation, 
recognition and contextual integration problems simultane- 
ously, so as to produce a globally consistent labeling. One 
of the obstacles to FSL is that the information necessary for 
the labeling of a given pixel may come from very distant 
pixels as well as their labels. The category of a pixel may 
depend on relatively short-range information (e.g. the pres- 
ence of a human face generally indicates the presence of a 
human body nearby), as well as on very long-range depen- 
dencies (is this grey pixel part of a road, a building, or a 
cloud?). 

This paper proposes a new method for FSL, depicted on 
Figure 1 that relies on five main ingredients: 1) Trainable, 
dense, multi-scale feature extraction: a multi-scale, dense 
feature extractor produces a series of feature vectors for re- 
gions of multiple sizes centered around every pixel in the 
image, covering a large context. The feature extractor is 
a two-stage convolutional network applied to a multi-scale 
contrast-normalized laplacian pyramid computed from the 
image. The convolutional network is fed with raw pix- 
els and trained end to end, thereby alleviating the need for 
hand-engineered features; 2) Segmentation Tree: A graph 
over pixels is computed in which each pixel is connected to 
its 4 nearest neighbors through an edge whose weight is a 
measure of dissimilarity between the colors of the two pix- 
els. A segmentation tree is then constructed using a classical 
region merging method, based on the minimum spanning 
tree of the graph. Each node in the tree corresponds to a po- 
tential image segment. The final image segmentation will 
be a judiciously chosen subset of nodes of the tree whose 
corresponding regions cover the entire image. 3) Region- 
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wise feature aggregation: for each node in the tree, the 
corresponding image segment is encoded by a 5 x 5 spatial 
grid of aggregated feature vectors. The aggregated feature 
vector of each grid cell is computed by a component- wise 
max pooling of the feature vectors centered on all the pixels 
that fall into the grid cell; This produces a scale-invariant 
representation of the segment and its surrounding; 4) Class 
histogram estimation: a classifier is then applied to the ag- 
gregated feature grid of each node. The classifier is trained 
to estimate the histogram of all object categories present in 
its input segments; 5) Optimal purity cover: a subset of 
tree nodes is selected whose corresponding segments cover 
the entire image. The nodes are selected so as to minimize 
the average "impurity" of the class distribution. The class 
"impurity" is defined as the entropy of the class distribution. 
The choice of the cover thus attempts to find a consistent 
overall segmentation in which each segment contains pixels 
belonging to only one of the learned categories. 

All the steps in the process have a complexity linear (or 
almost linear) in the number of pixels. The bulk of the com- 
putation resides in the convolutional network feature extrac- 
tor. The resulting system is very fast, producing a full parse 
of a 320 X 240 image in less than 1 second on a conven- 
tional CPU. Once trained, the system is parameter free, and 
requires no adjustment of thresholds or other knobs. 

There are three key contributions in this paper 1) using 
a multi-scale convolutional net to learn good features for 
region classification; 2) using a class purity criterion to de- 
cide if a segment contains a single objet, as opposed to sev- 
eral objects, or part of an object; 3) an efficient procedure 
to obtain a cover that optimizes the overall class purity of a 
segmentation. 

2. Related work 

The problem of scene parsing has been approached with 
a wide variety of methods in recent years. Many methods 
rely on MRFs, CRFs, or other types of graphical models 
to ensure the consistency of the labeling and to account 
for context [Q, 22, 6, 13, 17, 24]. Most methods rely on 
a pre-segmentation into super-pixels or other segment can- 
didates [6, 13, 17, 24], and extract features and categories 
from individual segments and from various combinations of 
neighboring segments. The graphical model inference pulls 
out the most consistent set of segments that cover the image. 

Socher et al. [ ] propose a method to aggregate seg- 
ments in a greedy fashion using a trained scoring function. 
The originality of the approach is that the feature vector of 
the combination of two segments is computed from the fea- 
ture vectors of the individual segments through a trainable 
function. Like us, they use "deep learning" methods to train 
their feature extractor. But unlike us, their feature extractor 
operates on hand-engineered features. 

One of the main question in scene parsing is how to 




Figure 1. Diagram of the scene parsing system. The raw input 
image is transformed through a Laplacian pyramid. Each scale is 
fed to a 2- stage convolutional network, which produces a set of 
feature maps. The feature maps of all scales are concatenated, the 
coarser-scale maps being upsampled to match the size of the finest- 
scale map. Each feature vector thus represents a large contextual 
window around each pixel. In parallel, a segmentation tree is com- 
puted via the minimum spanning tree of the dissimilarity graph of 
neighboring pixels. The segment associated with each node in the 
tree is encoded by a spatial grid of feature vectors pooled in the 
segment's region. A classifier is then applied to all the aggregated 
feature grids to produce a histogram of categories, the entropy of 
which measures the "impurity" of the segment. Each pixel is then 
labeled by the minimally-impure node above it, which is the seg- 
ment that best "explains" the pixel. 



2 



take a wide context into account to make a local decision. 
Munoz et al. [ ] proposed to use the histogram of labels 
extracted from a coarse scale as input to the labeler that 
look at finer scales. Our approach is somewhat simpler: our 
feature extractor is applied densely to an image pyramid. 
The coarse feature maps thereby generated are upsampled 
to match that of the finest scale. Hence with three scales, 
each feature vector has multiple fields which encode mul- 
tiple regions of increasing sizes and decreasing resolutions, 
centered on the same pixel location. 

Like us, a number of authors have used trees to generate 
candidate segments by aggregating elementary segments, as 
in [22]. Using trees allows to rely on fast inference algo- 
rithms based on graph cuts or other methods. In this paper, 
we use an innovative method based on finding a set of tree 
nodes that cover the images while minimizing some crite- 
rion. 

Our system extracts features densely from a multiscale 
pyramid of images using a convolutional network (Con- 
vNet) [ ^ ^ ] . ConvNets can be fed with raw pixels and can 
automatically learn low-level and mid-level features, alle- 
viating the need for hand-engineered features. One big ad- 
vantage of ConvNets is the ability to compute dense features 
efficiently over large images. ConvNets are best known for 
their applications to detection and recognition [20, 11], but 
they have also been used for image segmentation, particu- 
larly for biological image segmentation [ 9, 10, 25]. 

The only published work on using ConvNets for scene 
parsing is that of Grangier et al. [ ] . While somewhat pre- 
liminary, their work showed that convolutional networks fed 
with raw pixels could be trained to perform scene parsing 
with decent accuracy. Unlike [ ] however, our system uses 
a boundary-based over- segmentation to align the labels pro- 
duced by the ConvNet to the boundaries in the image. Our 
system also takes advantage of the boundary-based over- 
segmentation to produce representations that are indepen- 
dent of the size of the segment through feature pooling. 

3. An end-to-end trainable model for scene 
parsing 

The model proposed in this paper, depicted on Figure 1, 
relies on two complementary image representations. In 
the first representation, the image is seen as a point in a 
high-dimensional space, and we seek to find a transform 
/ : that maps these images into a space in 

which each pixel can be assigned a label using a simple 
linear classifier. This first representation typically suffers 
from two main problems: (1) the window considered rarely 
contains an object that is properly centered and scaled, and 
therefore offers a poor observation basis to predict the class 
of the underlying object, (2) integrating a large context in- 
volves increasing the grid size, and therefore the dimen- 
sionality P of the input; given a finite amount of training 



data, it is then necessary to enforce some invariance in the 
function / itself. This is usually achieved by using pool- 
ing/subsampling layers, which in turn degrades the ability 
of the model to precisely locate and delineate objects. In 
this paper, / is implemented by a multiscale convolutional 
network, which allows integrating large contexts (as large 
as the complete scene) into local decisions, yet still remain- 
ing manageable in terms of parameters/dimensionality. This 
multiscale model, in which weights are shared across scales, 
allows the model to capture long-range interactions, with- 
out the penalty of extra parameters to train. This model is 
described in Section 3.1. 

In the second representation, the image is seen as an 
edge-weighted graph, on which a hierarchy of segmenta- 
tions/clusterings can be constructed. This representation 
yields a natural abstraction of the original pixel grid, and 
provides a hierarchy of observation levels for all the objects 
in the image. It can be used as a solution to the first prob- 
lem exposed above: assuming the capability of assessing 
the quality of all the components of this hierarchy, a system 
can automatically choose its components so as to produce 
the best set of predictions. Moreover, these components 
are spatially accurate, and naturally delineate the underly- 
ing objects, as this representation conserves pixel-level pre- 
cision. Section 3.2 describes our methodology. 

3.1. Scale-invariant, scene-level feature extraction 

Our feature extractor is based on a convolutional net- 
work. Convolutional networks are natural extensions of 
neural networks, in which weights are replicated over space, 
or in other terms the linear transforms are done using 2D 
convolutions. A convolution can be seen as a linear trans- 
form with shared (replicated) weights. The use of weight 
sharing is justified by the fact that image statistics are sta- 
tionary, and features and combinations of features that are 
relevant in one region of an image are also relevant in other 
regions. In fact, by enforcing this constraint, each layer of a 
convolutional network is explicitly forced to model features 
that are shift-equivariant. Because of the imposed weight- 
sharing, convolutional networks have been used success- 
fully for a number of image labeling problems. 

More holistic tasks, such as full-scene understanding 
(pixel-wise labeling, or any dense feature estimation) re- 
quire the system to model complex interactions at the scale 
of complete images, not simply within a patch. In this prob- 
lem the dimensionality becomes unmanageable: for a typi- 
cal image of 256 x 256 pixels, a naive neural network would 
require millions of parameters, and a naive convolutional 
network would require filters that are unreasonably large to 
view enough context. 

Our multiscale convolutional network overcomes these 
limitations by extending the concept of weight replication 
to the scale space. Given an input image I, a multiscale 
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pyramid of images X^, Vs G {1, . . . , N} is constructed, 
with Xi being the size of I. The multiscale pyramid can 
be a Laplacian pyramid, and is typically pre-processed, so 
that local neighborhoods have zero mean and unit standard 
deviation. Given a classical convolutional network fs with 
parameters Os , the multiscale network is obtained by instan- 
tiating one network per scale s, and sharing all parameters 
across scales: Os = Oq, G {1, . . . , A^}. 

More precisely, the output features are computed using 
the scaling/normalizing function gs as X^ = ^s(I) for all 
s e {l,...,A/'}. The convolutional network fs can then be 
described as a sequence of linear transforms, interspersed 
with non-linear symmetric squashing units (typically the 
tanh function): = /^(Xg;^^) = WlHl-i, with 
= tanh(W^H^_i+b/)foralU G {1, where 
is the vector of hidden units at layer for a network with 
L layers, Hq = X^ and is a vector of bias parameters. 
The matrices are Toeplitz matrices, and therefore each 
hidden unit vector can be expressed as a regular convo- 
lution between the kernel w^^^ and the previous hidden unit 
vector H^_i 

Uip = tanh{bip + ^ wipq * H^-i,g). (1) 

g'Gparents(p) 

The filters w^^^ and the biases hi constitute the trainable 
parameters of our model, and are collectively denoted 6s . 

Finally, the output of the TV networks are upsampled and 
concatenated so as to produce F, a map of feature vectors 
the size of Fi, which can be seen as local patch descriptors 
and scene-level descriptors 



[Fi,^F2),...,^F^)], 



(2) 



where u is an upsampling function. 

As mentioned above, weights are shared between net- 
works fs. Intuitively, imposing complete weight sharing 
across scales is a natural way of forcing the network to learn 
scale invariant features, and at the same time reduce the 
chances of over-fitting. The more scales used to jointly train 
the models fs{Os) the better the representation becomes for 
all scales. Because image content is, in principle, scale in- 
variant, using the same function to extract features at each 
scale is justified. In fact, we observed a performance de- 
crease when removing the weight- sharing. 

3.2. Parameter-free hierarchical parsing 

Predicting the class of a given pixel from its own feature 
vector is difficult, and not sufficient in practice. The task is 
easier if we consider a spatial grouping of feature vectors 
around the pixel, i.e. a neighborhood. Among all possible 
neighborhoods, one is the most suited to predict the pixel's 
class. In Section 3.2.1 we propose to formulate the search 
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Figure 2. Finding the optimal cover. For each pixel (leaf) i, the 
optimal component Ck*(i) is the one along the path between the 
leaf and the root with minimal cost Sk*(i). The optimal cover is 
the union of all these components. In this example, the optimal 
cover {Ci, C3, C4, C5} will result in a segmentation in disjoint 
sets {Ci, C2, C3, C4}, with the subtle difference that component 
C2 will be labelled with the class of C5, as C5 is the best observa- 
tion level for C2. 



for the most adapted neighborhood as an optimization prob- 
lem. The construction of the cost function that is minimized 
is then described in Section 3.2.2. 



3.2.1 Optimal purity cover 

We define the neighborhood of a pixel as a connected com- 
ponent that contains this pixel. Let C/c, \/k G {!,..., K} 
be the set of all possible connected components of the lattice 
defined on image /, and let Sk be a cost associated to each 
of these components. For each pixel i, we wish to find the 
index of the component that best explains this pixel, 
that is, the component with the minimal cost Sk*{i) : 



= argminS'/c 

k I ieCk 



(3) 



Note that components are non-disjoint sets that 

form a cover of the lattice. Note also that the overall cost 
S* = is minimal. 

In practice, the set of components Ck is too large, and 
only a subset of it can be considered. A classical technique 
to reduce the set of components is to consider a hierarchy of 
segmentations [18, 1,0], that can be represented as a tree T. 
Solving Eq 3 on T can be done simply by exploring the tree 
in a depth- first search manner, and finding the component 
with minimal weight along each branch. Figure 2 illustrates 
the procedure. 



3.2.2 Producing the confidence costs 

Given a set of components C^, we explain how to produce 
all the confidence costs Sk- These costs represent the class 
purity of the associated components. Given the groundtruth 
segmentation, we can compute the cost as being the entropy 
of the distribution of classes present in the component. At 
test time, when no groundtruth is available, we need to de- 
fine a function that can predict this cost by simply looking 
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Figure 3. The shape-invariant attention function a. For each com- 
ponent Ck in the segmentation tree T, the corresponding image 
segment is encoded by a spatial grid of feature vectors that fall 
into this segment. The aggregated feature vector of each grid cell is 
computed by a component- wise max pooling of the feature vectors 
centered on all the pixels that fall into the grid cell; this produces a 
scale-invariant representation of the segment and its surroundings. 
The result, Ofc, is a descriptor that encodes spatial relations be- 
tween the underlying object's parts. The grid size was set to 5 x 5 
for all our experiments. 



at the component. We now describe a way of achieving this, 
as illustrated in Figure 3. 

Given the scale-invariant features F, we define a com- 
pact representation to describe objects as an elastic spatial 
arrangement of such features. In other terms, an object, or 
category in general, can be best described as a spatial ar- 
rangement of features, or parts. A simple attention function 
a is used to mask the feature vector map with each com- 
ponent Ck, producing a set of K masked feature vector 
patterns {F p| C/c}, V/c G {!,..., K}. The function a is 
called an attention function because it suppresses the back- 
ground around the component being analyzed. The patterns 
{F Pi C/c} are resampled to produce fixed-size representa- 
tions. In our model the sampling is done using an elastic 
max-pooling function, which remaps input patterns of arbi- 
trary size into a fixed G xG grid. This grid can be seen as a 
highly invariant representation that encodes spatial relations 
between an object's attributes/parts. This representation is 
denoted O^. Some nice properties of this encoding are: 
(1) elongated, or in general ill- shaped objects, are nicely 
handled, (2) the dominant features are used to represent the 
object, combined with background subtraction, the features 
pooled represent solid basis functions to recognize the un- 
derlying object. 

Once we have the set of object descriptors O/e, we define 
a function c : Ok [0, l]^'' (where Nc is the number of 
classes) as predicting the distribution of classes present in 
component Ck- We associate a cost Sk to this distribution. 
In this paper c is implemented as a simple 2-layer neural 
network, and Sk is the entropy of the predicted distribution. 
More formally, let be the feature vector associated with 
component O^, d/^ the predicted class distribution, and Sk 



the cost associated to this distribution. We have 

Yk = W2tanh(WiXfe + bi), (4) 

dfe,a = (5) 

^^bGclasses 

^/c = - ^ dfc,alog(d/e,a). (6) 

aGclasses 

Matrices Wi and W2 are noted Oc, and represent the train- 
able parameters of c. These parameters need to be learned 
over the complete set of hierarchies, computed on the en- 
tire training set available. The exact training procedure is 
described in Section 4. 

4. Training procedure 

Let T be the set of all feature maps in the training set, 
and T the set of all hierarchies. Training the model de- 
scribed in Section 3 can be done in two steps. First, we train 
the low-level feature extractor fs in complete independence 
of the rest of the model. The goal of that first step is to pro- 
duce features (F)fg that are maximally discriminative for 
pixelwise classification. Next, we construct the hierarchies 
{T)TeT on the entire training set, and, for all T G T train 
the classifier c to predict the distribution of classes in com- 
ponent Ck G T, as well as the costs Sk- Once this second 
part is done, all the functions in Figure 1 are defined, and 
inference can be performed on arbitrary images. In the next 
two sections we describe these two steps. 

4.1. Learning discriminative scale-invariant fea- 
tures 

As described in Section 3.1, feature vectors in F are ob- 
tained by concatenating the outputs of multiple networks 
fs, each taking as input a different image in a multiscale 
pyramid. Ideally a linear classifier should produce the cor- 
rect categorization for all pixel locations i, from the feature 
vectors F^. We train the parameters 6s to achieve this goal. 
Let Ci be the true target vector for pixel i and be the nor- 
malized prediction from the linear classifier, we set: 

Lcat — ^ ^ ^cat(Ci,C^), (7) 

iG pixels 

^cat(Ci,Ci) = - ^ Ci,aln(Ci,a), (8) 
a G classes 

'''''' - ^ 

^-^bGclasses 

The elementary loss function /cat(ci, c^) in Eq 7 is chosen 
to penalize the deviation of the multiclass prediction from 
the target vector c^. In this paper, we use the multiclass 
cross entropy loss function. In order to use this loss func- 
tion, we compute a normalized predicted probability distri- 
bution over classes Ci^a using the softmax function in Eq 9. 
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The cross entropy between the predicted class distribution 
and the target class distribution at a pixel location i is then 
measured by Eq 8. The true target probability Ci^a of class 
a to be present at location i can either be a distribution of 
classes at location in a given neighborhood or a hard tar- 
get vector: Ci^a = 1 if pixel i is labeled a, and otherwise. 
For training maximally discriminative features, we use hard 
target vectors in this first stage. Once the parameters 6s are 
trained, we discard the classifier in Eq 9. 

4.2. Teaching a classifier to find its best observation 
level 

Given the trained parameters Os, we build T and T, i.e. 
we compute all the vector maps F and the hierarchies T 
on all the training data available, so as to produce a new 
training set of descriptors O^. This time, the parameters 6c 
of the classifier c are trained to minimize the KL-divergence 
between the true (known) distributions of labels in each 
component, and the prediction from the classifier (Eq 5): 

ld^v= Yl dfc,X^). (10) 

a G classes ^^'^ 

In this setting, the groundtruth distributions are not 
hard target vectors, but normalized histograms of the la- 
bels present in component Ck. Once the parameters Oc are 
trained, d^ accurately predicts the distribution of labels, and 
Eq 6 can be used to assign a purity cost to the component. 

5. Experiments 

We report results on three standard datasets. (1) The 
Stanford Background dataset, introduced in [ ] for eval- 
uating methods for semantic scene understanding. The 
dataset contains 715 images chosen from other existing 
public datasets so that all the images are outdoor scenes, 
have approximately 320 x 240 pixels, and contain at least 
one foreground object. We use the evaluation procedure in- 
troduced in [ ], 5 -fold cross validation: 572 images used 
for training, and 142 for testing. (2) The SIFT Flow dataset, 
as described in Liu et al. [ ]. This dataset is composed 
of 2, 688 images that have been thoroughly labeled by La- 
belMe users. Liu et al.[ '] have split this dataset into 2, 488 
training images and 200 test images and used synonym cor- 
rection to obtain 33 semantic labels. We use this same train- 
ing/test split. (3) The Barcelona dataset, as described in 
Tighe et al. [24], is derived from the LabelMe subset used 
in [21]. It has 14, 871 training and 279 test images. The test 
set consists of street scenes from Barcelona, while the train- 
ing set ranges in scene types but has no street scenes from 
Barcelona. Synonyms were manually consolidated by [ ] 
to produce 170 unique labels. 

For all experiments, we use a 2-stage convolutional net- 
work. The input I, a 3-channel image, is transformed into a 



16-dimension feature map, using a bank of 16 7 x 7 filters 
followed by tanh units; this feature map is then pooled us- 
ing a 2 X 2 max-pooling layer; the second layer transforms 
the 16-dimension feature map into a 64-dimension feature 
map, each component being produced by a combination of 
8 7x7 filters (512 filters), followed by tanh units; the map 
is pooled using a 2 x 2 max-pooling layer; finally the 64- 
dimension feature map is transformed into a 256-dimension 
feature map, each component being produced by a combi- 
nation of 16 7 X 7 filters (2048 filters). 

The network is applied to a locally normalized Laplacian 
pyramid constructed on the input image. For these experi- 
ments, the pyramid consists of 3 rescaled versions of the 
input (TV = 3), in octaves: 320 x 240, 160 x 120, 80 x 60. 
All inputs are properly padded, and outputs of each of the 
3 networks upsampled and concatenated, so as to produce 
a 256 X 3 = 768-dimension feature vector map F. The 
network is trained on all 3 scales in parallel. 

Simple grid-search was performed to find the best learn- 
ing rate and regularization parameters (weight decay), using 
a holdout of 10% of the training dataset for validation. More 
regularization was necessary to train the classifier c. For 
both datasets, jitter was used to artificially expand the size 
of the training data, and ensure that the features do not over- 
fit some irrelevant biases present in the data. Jitter includes: 
horizontal flipping of all images, and rotations between —8 
and 8 degrees. 

In this paper, the hierarchy used to find the optimal cover 
is a simple hierarchy constructed on the raw image gradient, 
based on a standard volume criterion [1 6, 4], completed by 
a removal of non-informative small components (less than 
100 pixels). Classically segmentation methods find a par- 
tition of the segments rather than a cover. Partitioning the 
segments consists in finding an optimal cut in the tree (so 
that each terminal node in the pruned tree corresponds to 
a segment). We experimented with a number of graph cut 
methods to do so, including graph-cuts [ , ], Kruskal [12] 
and Power Watersheds [ ], but the results were systemati- 
cally worse than with our optimal cover method. 

On the Stanford dataset, we report two experiments: a 
baseline system, based on the multiscale convolutional net- 
work alone; and the full model as described in Section 3. 
Results are reported in Table 1. On the two other datasets, 
we report results for our complete model only, in Tables 2 
and 3. Example parses on the SIFT Flow dataset are shown 
on Figure 4. 

Baseline, multiscale network: for our baseline, the 
multiscale network is trained as a simple class predictor for 
each location i, using the single classification loss Lcat de- 
fined in Eq 7. With this simple system, the pixelwise accu- 
racy is surprisingly good, but the visual aspect of the predic- 
tions clearly suffer from poor spatial consistency, and poor 
object delineation. 



6 





P/C 


CT 


Gould et al. [ ] 


76.4% / - 


10s to lOmin 


Munoz et al. [ ] 


76.9% / 66.2% 


12s 


Tighe et al. [ ] 


11.5% 1 - 


10s to 5min 


Socher et al. [ ] 


78.1%/- 


9 


Kumar et al. [ ] 


79.4% / - 
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multiscale net 


77.5 % 7 70.0% 


0.5s 


multiscale net + cover 


79.5% 7 74.3% 
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Table 1. Performance of our system on the Stanford Background 
dataset [6]: per-pixel accuracy 7 average per-class accuracy. The 
third column reports approximate compute times, as reported by 
the authors. Note: we benchmarked our algorithms using a mod- 
ern 4-core Intel i7, which could give us an unfair advantage over 
the competition. 



Complete system, network and hierarchy: in this sec- 
ond experiment, we use the complete model, as described 
in Section 3. The 2— layer neural network (Eq 4) has 
3 X 3 X 3 X 256 = 6912 input units (using a 3 x 3 grid 
of feature vectors from F), 512 hidden units; and 8 output 
units are needed for the Stanford Background dataset, 33 for 
the SIFT Flow dataset, and 170 for the Barcelona dataset. 
Results are significantly better than the baseline method, in 
particular, much better delineation is achieved. 

For the SIFT Flow dataset, we experimented with two 
sampling methods when learning the multiscale features: 
respecting natural frequencies of classes, and balancing 
them so that an equal amount of each class is shown to 
the network. Both results are reported in Table 2. Train- 
ing with balanced frequencies allows better discrimination 
of small objects, and although it decreases the overall pixel- 
wise accuracy, it is more correct from a recognition point of 
view. Frequency balancing was used on the Stanford Back- 
ground dataset, as it consistently gave better results. For 
the Barcelona dataset, both sampling methods were used as 
well, but frequency balancing worked rather poorly in that 
case. This could be explained by the fact that this dataset 
has a large amount of classes with very few training exam- 
ples. These classes are therefore extremely hard to model, 
and overfilling occurs much faster than for the SIFT Flow 
dataset. Results are shown on Table 3. 

Results in Table 1 also demonstrate the impressive com- 
putational advantage of convolutional networks over com- 
peting algorithms. Training time is also remarkably fast: 
results on the Stanford Background dataset were typically 
obtained in 24h on a regular server. 

6. Discussion 

We introduced a discriminative framework for learning 
to identify and delineate objects in a scene. Our model does 
not rely on engineered features, and uses a multi- scale con- 
volutional network operating on raw pixels to learn appro- 
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Table 2. Performance of our system on the SIFT Flow dataset [15]: 
per-pixel accuracy / average per-class accuracy. Our multiscale 
network is trained using two sampling methods: ^natural frequen- 
cies, ^balanced frequencies. 
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Table 3. Performance of our system on the Barcelona dataset [24]: 
per-pixel accuracy / average per-class accuracy. Our multiscale 
network is trained using two sampling methods: ^natural frequen- 
cies, ^balanced frequencies. 



priate low-level and mid-level features. The convolutional 
network is trained in supervised mode to directly produce 
labels. Unlike many other scene parsing systems that rely 
on expensive graphical models to ensure consistent label- 
ings, our system relies on a segmentation tree in which the 
nodes (corresponding to image segments) are labeled with 
the entropy of the distribution of classes contained in the 
corresponding segment. Instead of graph cuts or other in- 
ference methods, we use the new concept of optimal cover 
to extract the most consistent segmentation from the tree. 

The complexity of each operation is linear in the num- 
ber of pixels, except for the production of the tree, which 
is quasi-linear (meaning cheap in practice). The system 
produces state-of-the-art accuracy on the Stanford Back- 
ground, SIFT Flow, and Barcelona datasets (both measured 
per pixel, or averaged per class), while dramatically outper- 
forming competing models in inference time. 

Our current system relies on a single segmentation tree 
constructed from image gradients, and implicitly assumes 
that the correct segmentation is contained in the tree. Fu- 
ture work will involve searches over multiple segmentation 
trees, or will use other graphs than simple trees to encode 
the possible segmentations (since our optimal cover algo- 
rithm can work from other graphs than trees). Other direc- 
tions for improvements include the use of structured learn- 
ing criteria such as Turaga et al.'s Maximin Learning [ ] to 
learn low-level feature vectors from which better segmenta- 
tion trees can be produced. 
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